# Introduction
# This dataset has been populated after consolidating results of a survey that was rolled out to salaried professionals
# across various industries. The dataset is a financial dataset called "Data Professional Salary Survey Results" and contains
# responses across 3 years (2019, 2018, & 2017). For 2019 , 882 people responded from 46 different nations as stated on the
# data.world website. The total number of participants varied across 85 unique nations and anyone who wants to get a brief idea about how much do professionals like database administrators, analysts, architects, developers, and data scientists make annualy, then this dataset is pretty helpful.
# The dataset is pretty comprehensive and has considered various factors(column names) like the highest level of education of the individual, current designation/job title, whether the individual holds any certification, number of hours worked per week and many more. The dataset also has a column that asks for future plans of the individual and whether he/she is looking for a new job which is pretty interesting to kind of predict the happiness quotient of the individual.
# There are around 5 columns which contain null values in them namely PostalCode,OtherDatabases,EducationIsComputerRelated,
# OtherJobDuties and KindsOfTasksPerformed. A few of the column names are - 'Survey Year', 'Timestamp', 'SalaryUSD', 'Country',
# and there are a total of 29 columns. The total number of rows are 6893 and the different types of datatypes include -
# object(string/text/mixed numeric or non numeric values),int64(integer numbers) and datatime64(date and time values)
## Project Aim / Goal
# The main goal of our project is to show the various factors that affect the salary of IT professionals across the world. We identified some major factors/variables like
# 1) The number of hours worked per week by an indvidual<br>
# 2) The education level of an individual<br>
# 3) Gender<br>
# 4) Job Title<br>
# 5) Employment Sector<br>
# 6) Country in which the individual is working<br>
# 7) Years of experience in that particular job<br>
# 8) Employment Status<br>
# 9) Years of experience with a particular database
import numpy as np
import os
import pandas as pd # needed for reading CSV
import seaborn as sns #trying a new library to plot
import matplotlib.pyplot as plt # for common graph plotting
import plotly.express as px
import random
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms
import bqplot
import traitlets
import ipywidgets
from ipywidgets import interact
#importing an excel file and specifying header as fourth row since actual column names are represented there
# since index starts from 0 , header value = 3
df = pd.read_excel("C:/Users/rahul/Downloads/UIUC/Sem 1 - Fall 2019/Courses/Data Visualization/Final Project/2019_Data_Professional_Salary_Survey_Responses.xlsx",sheet_name='Salary Survey', header = 3)
# df.head()
#converting salary to float
df['SalaryUSD'] = pd.to_numeric(df['SalaryUSD'],errors='coerce')
figure_1 = px.scatter(df, x="HoursWorkedPerWeek", y="SalaryUSD", log_x=True,hover_name="JobTitle",hover_data=["YearsWithThisTypeOfJob"])
figure_1.update_layout(
title="Salary of Individuals vs Hours Worked Per Week",
xaxis_title="Hours Worked Per Week",
yaxis_title="Salary in USD",
font=dict(
family="Courier New, monospace",
size=18))
figure_1.show()
# The above visualization is a master visualization for our project which shows a scatter plot of the hours worked by professionals and their corresponding yearly salaries based on the professions. Looking at the Scatterplot we can see that the average number of hours that people work in this data set is between 40-50 hours per week and the average salary lies majorly in the range of 100k to 150k.
# There are a few outliers which we see as well where there are 2 DBA professionals working over 150 hours per week and a manager
# working for 200 hours. This is probably due to a wrong data entry by the individuals.
# The plot is interactive in nature as well where we can zoom in on an area to get a more indepth analysis about the professionals and their salaries. The tool tip gives additional information about the Job Title, Years of experience with the particular job , salary and the hours worked per week by the individual.
# There are three data points which have high salaries compared to others and the possible reason is the years of experience which is greater than 10 for each thus justifying the salary.
# The plots below show potential factors that can contribute to the salary of individuals and inferences observed from each of them.
# df['JobTitle'].value_counts()
plot_1 = df[df['JobTitle'].isin(["Engineer","Analyst","Manager","Developer: App code (C#, JS, etc)","Developer: Business Intelligence (SSRS, PowerBI, etc)","Architect","Developer: T-SQL","DBA","DBA (Production Focus - build & troubleshoot servers, HA/DR)","DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)"])].groupby('JobTitle')['HoursWorkedPerWeek'].mean().plot(kind='bar',color="red")
plot_1.set_xlabel("JobTitle")
plot_1.set_ylabel("Average working Hours per week",color="red",fontsize=10)
plot_1.legend(["Avg. hours per week"],loc='best',fontsize=8)
plot_2=plot_1.twinx()
plt.rcParams["figure.dpi"] = 100
#ax2_sorted = ax.sort_values('SalaryUSD',ascending=False)
plot_2 = df[df['JobTitle'].isin(["Engineer","Analyst","Manager","Developer: App code (C#, JS, etc)","Developer: Business Intelligence (SSRS, PowerBI, etc)","Architect","Developer: T-SQL","DBA","DBA (Production Focus - build & troubleshoot servers, HA/DR)","DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)"])].groupby('JobTitle')['SalaryUSD'].mean().plot(kind='bar',color="blue")
plot_2.set_ylabel("Average Salary",color="blue",fontsize=10)
plt.title('10 most popular Job Professions based on the average salary and average number of hours worked per week')
plot_2.legend(["Avg. Salary"],loc='best',fontsize=8)
plt.ylim(30, 200000)
plt.show()
# The above plot gives an overview of the 10 most popular professions as per the survey results and shows the average number of hours an individual needs to work for and the corresponding average salary he/she can expect. The 10 most popular jobs were chosen based on the job titles which had the maximum entry counts in the survey as listed in the cell above the visualization.
# A big limititation of the above plot is that the 10 professions listed may not necessarily be the top 10 in reality beacause if a future survey was done again, there could be many more people who would have roles like a consultant on database specialist and thus they could very well replace the current most popular job professions
Avg_salary = df.groupby('Education')['SalaryUSD'].mean()
edu_type = df['Education']
y_pos = np.arange(5)
salary = Avg_salary
plt.bar(y_pos, salary, align='center', color = "blue")
for x,y in zip(y_pos,salary):
label = "{:.2f}".format(y)
plt.annotate(label, # this is the text
(x,y), # this is the point to label
textcoords="offset points", # how to position the text
xytext=(0,10), # distance from text to points (x,y)
ha='center')
plt.xticks(y_pos, ('Associates', 'Bachelors', 'Doctorate/PhD', 'Masters', 'None'))
plt.ylabel('Average Salary')
plt.xlabel('Highest Education Obtained')
plt.title('Average Salary based on highest education level')
plt.rcParams["figure.dpi"] = 100
plt.ylim(85000, 107000)
plt.show()
salary_avg=df.groupby("Education").mean()
%matplotlib inline
salary_avg[["HoursWorkedPerWeek"]].plot.bar(align='center', color = "green")
plt.ylabel('Average Hours worked per week')
plt.xlabel('Education')
plt.title('Average hours worked per week based on education level')
plt.rcParams["figure.dpi"] = 100
salary_avg=df.groupby("Education").mean()
%matplotlib inline
salary_avg[["YearsWithThisTypeOfJob"]].plot.bar(align='center', color = "purple")
plt.ylabel('Average experience in years')
plt.xlabel('Education')
plt.title('Average experience in years based on education level')
plt.rcParams["figure.dpi"] = 100
import mplcursors
salary_avg=df.groupby("Education").mean()
%matplotlib inline
salary_avg[["YearsWithThisDatabase"]].plot.bar(align='center', color = "yellow")
plt.ylabel('Average experience with databases in years')
plt.xlabel('Education')
plt.title('Average experience with databases in years based on education level')
plt.rcParams["figure.dpi"] = 100
# In the graph titled "Average Salary based on highest education level" above we wanted to show the average salary of individuals based on their highest education level.
# It is surprising to see that individuals with a masters education are earning an average salary that is lower than those
# with bachelors education.One possible reason for this could be that there are only 1236 data entries for masters whereas
# bachelor education has over 3600 data entries. Another possible reason could be that countries with only 1-5 data entries
# may have individuals with a bachelors degree which thus siginficantly increases that average and paints a wrong picture
# However an interesting find was that the average number of hours worked per week and the average years of experience for both the masters and bachelors education level was almost the same and thus these factors probably do not explain the variance in the salary. But there was another factor which was concerned with the number of years of experience an individual had handling
# the primary databases in his/her organization and it was surprising to see that the average for bachelors was 35 years wheras for masters education was 10 years and this could probably be the reason for the higher average salary of bachelor education individuals.
Developer_sal =df[df['JobTitle'].isin(["DBA / BI Developer","Developer: Business Intelligence (SSRS, PowerBI, etc)","Developer: App code (C#, JS, etc)","Developer: T-SQL"])].groupby(['Survey Year'])['SalaryUSD'].mean()
year = df['Survey Year']
y_pos = np.arange(3)
salary = Developer_sal
plt.plot(y_pos, salary, color = "blue",marker = 'o')
for x,y in zip(y_pos,salary):
label = "{:.2f}".format(y)
plt.annotate(label, # this is the text
(x,y), # this is the point to label
textcoords="offset points", # how to position the text
xytext=(0,15), # distance from text to points (x,y)
ha='center',
va='top')
plt.xticks(y_pos,('2017', '2018', '2019'))
plt.ylabel('Average Salary')
plt.xlabel('Year')
plt.title('Average Yearly Salary for Developer roles')
plt.rcParams["figure.dpi"] = 150
plt.ylim(80000, 91000)
plt.show()
# We decided to show the trend of the yearly average salary of a specific job title which involved developer specific roles. From the above visualization we see that there has been a steady increase in the average salaries of developers and thus this may very well be a great profession for individuals to explore in the near future.
df1 = df[['SalaryUSD', 'Gender', 'JobTitle']].copy().dropna()#dropping null values
# print(df1)
#df1.query('Gender'=="Male")["SalaryUSD"]
m_sal = (df1[df1['Gender']=="Male"]['SalaryUSD'].values)
countm = len(m_sal)
#print(countm)
#print(m_sal)
f_sal = (df1[df1['Gender']=="Female"]['SalaryUSD'].values)
countf = len(f_sal)
#print(countf)
#print(f_sal)
f_sal = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Analyst")]['SalaryUSD'].values)
#print(f_sal)
(np.mean(f_sal))
m_sal = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Analyst")]['SalaryUSD'].values)
#print(m_sal)
(np.mean(m_sal))
f_arch = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Architect")]['SalaryUSD'].values)
#print(f_arch)
#print(np.mean(f_arch))
m_arch = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Architect")]['SalaryUSD'].values)
#print(m_arch)
#print(np.mean(m_arch))
f_ds = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Data Scientist")]['SalaryUSD'].values)
#print(f_ds)
#print(np.mean(f_ds))
m_ds = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Data Scientist")]['SalaryUSD'].values)
#print(m_ds)
#print(np.mean(m_ds))
f_dba1 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="DBA (Development Focus - tunes queries, indexes, does deployments)")]['SalaryUSD'].values)
#print(f_dba1)
#print(np.mean(f_dba1))
m_dba1 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="DBA (Development Focus - tunes queries, indexes, does deployments)")]['SalaryUSD'].values)
#print(m_dba1)
#print(np.mean(m_dba1))
f_dba2 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)")]['SalaryUSD'].values)
#print(f_dba2)
#print(np.mean(f_dba2))
m_dba2 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)")]['SalaryUSD'].values)
#print(m_dba2)
#print(np.mean(m_dba2))
f_dba3 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="DBA (Production Focus - build & troubleshoot servers, HA/DR)")]['SalaryUSD'].values)
#print(f_dba3)
#print(np.mean(f_dba3))
m_dba3 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="DBA (Production Focus - build & troubleshoot servers, HA/DR)")]['SalaryUSD'].values)
#print(m_dba3)
#print(np.mean(m_dba3))
f_dev1 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Developer: App code (C#, JS, etc)")]['SalaryUSD'].values)
#print(f_dev1)
#print(np.mean(f_dev1))
m_dev1 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Developer: App code (C#, JS, etc)")]['SalaryUSD'].values)
#print(m_dev1)
#print(np.mean(m_dev1))
f_dev2 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Developer: Business Intelligence (SSRS, PowerBI, etc)")]['SalaryUSD'].values)
#print(f_dev2)
#print(np.mean(f_dev2))
m_dev2 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Developer: Business Intelligence (SSRS, PowerBI, etc)")]['SalaryUSD'].values)
#print(m_dev2)
#print(np.mean(m_dev2))
f_dev3 = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Developer: T-SQL")]['SalaryUSD'].values)
#print(f_dev3)
#print(np.mean(f_dev3))
m_dev3 = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Developer: T-SQL")]['SalaryUSD'].values)
#print(m_dev3)
#print(np.mean(m_dev3))
#
f_eng = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Engineer")]['SalaryUSD'].values)
#print(f_eng)
#print(np.mean(f_eng))
m_eng = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Engineer")]['SalaryUSD'].values)
#print(m_eng)
#print(np.mean(m_eng))
f_mng = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Manager")]['SalaryUSD'].values)
#print(f_mng)
#print(np.mean(f_mng))
m_mng = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Manager")]['SalaryUSD'].values)
#print(m_mng)
#print(np.mean(m_mng))
f_other = (df1[(df1['Gender']=="Female") & (df1["JobTitle"]=="Other")]['SalaryUSD'].values)
#print(f_other)
#print(np.mean(f_other))
m_other = (df1[(df1['Gender']=="Male") & (df1["JobTitle"]=="Other")]['SalaryUSD'].values)
#print(m_other)
#print(np.mean(m_other))
# # initialize list of lists
# # please refer the cell above for more details on the derivation of numeric values in this data frame
data = [['Analyst', np.mean(m_sal), np.mean(f_sal)], ['Architect', np.mean(m_arch), np.mean(f_arch)], ['Data Scientist', np.mean(m_ds), np.mean(f_ds)], ['DBA 1', np.mean(m_dba1), np.mean(f_dba1)], ['DBA 2', np.mean(m_dba2), np.mean(f_dba2)], ['DBA 3', np.mean(m_dba3), np.mean(f_dba3)], ['Developer 1', np.mean(m_dev1), np.mean(f_dev1)], ['Developer 2', np.mean(m_dev2), np.mean(f_dev2)],
['Developer 3', np.mean(m_dev3), np.mean(f_dev3)], ['Engineer', np.mean(m_eng), np.mean(f_eng)], ['Manager', np.mean(m_mng), np.mean(f_mng)], ['Other', np.mean(m_other), np.mean(f_other)]]
# # Create the pandas DataFrame
df3 = pd.DataFrame(data, columns = ['JobTitle', 'MaleSal', 'FemaleSal'])
# # print dataframe.
#df3
plt.figure(figsize=(10,10))
ax = plt.axes()
x = np.linspace(70000, 120000, 1000)
plt.plot(x, x + 0, linestyle='solid', c="blue", label="Equal Wages")
plt.scatter(df3["MaleSal"], df3["FemaleSal"], c="orchid", s=100, label="Intersection of the mean salary in each job category")
for x,y in zip(df3["MaleSal"],df3["FemaleSal"]):
label = "{:.2f}".format(y)
plt.annotate(label, # this is the text
(x,y), # this is the point to label
textcoords="offset points", # how to position the text
xytext=(0,10), # distance from text to points (x,y)
ha='center')
plt.xlabel("Mean of Men's Salaries in Each Job Category (USD)")
plt.ylabel("Mean of Women's Salaries in Each Job Category (USD)")
plt.title('Relation Between Salaries of Men and Women')
z = np.polyfit(df3["MaleSal"], df3["FemaleSal"], 1)
p = np.poly1d(z)
plt.plot(df3["MaleSal"],p(df3["MaleSal"]),"r--", label="Trend Line")
plt.legend()
plt.tight_layout()
plt.show()
# This graph shows the disparity in salary among men and women. Each purple dot represents a job title and the dark blue diagonal line represents an ideal scenario where both men and women would have equal wages for a particular job title. Jobs that appear below the blue line, are those where women, on average, make less than men in a comparable profession.
# There are lesser jobs above or on that line by my count. It may also look like the higher the wage, the lesser the disparity.
# # We can infer that there are more number of jobs (data points) below the equal wages line as compared to on and above it.
# This suggests, there are more job categories which do not offer an equal pay to women as per the data of 2018-2019. Moreover, from the trend line we can infer that there is a positive correlation between the mean salaries of men and women in each job title, i.e the mean salaries of men are proportional to the mean salaries of women in a particular job category.
salary_avg=df.groupby("EmploymentSector").mean()
%matplotlib inline
salary_avg[["SalaryUSD"]].plot.bar(align='center', color = "red")
plt.ylabel('Average Salary in USD')
plt.xlabel('Employment Sector')
plt.title('Average Salary based for each Employment Sector')
plt.rcParams["figure.dpi"] = 100
# import mplcursors
# mplcursors.cursor(hover=True)
# The above visualization shows that the federal goverenment employment sector is most likely to earn the highest salary as per
# survey results over the last 3 fiscal years
from bqplot import *
from IPython.display import display
dropdown_1 = ipywidgets.Dropdown(
options=df.columns,
description='Field_1',
value='SalaryUSD',
disabled=False,
)
dropdown_2 = ipywidgets.Dropdown(
options=df.columns,
description='Field_2',
value='Education',
disabled=False,
)
x = df[dropdown_1.value]
y = df[dropdown_2.value]
x_sc = bqplot.LinearScale()
y_sc = bqplot.OrdinalScale()
x_ax = bqplot.Axis(scale = x_sc , label = 'x')
y_ax = bqplot.Axis(scale = y_sc , label = 'y',orientation = 'vertical')
scatter = bqplot.Scatter(x=x,y=y,scales = {'x': x_sc , 'y' : y_sc})
fig = bqplot.Figure(axes=[x_ax, y_ax], marks=[scatter])
def dropdown_1_eventhandler(change):
#print(change)
dropdown_1 = change['new']
scatter.x = df[dropdown_1]
def dropdown_2_eventhandler(change):
dropdown_2 = change['new']
scatter.y = df[dropdown_2]
dropdown_1.observe(dropdown_1_eventhandler, names="value")
dropdown_2.observe(dropdown_2_eventhandler, names="value")
#display(dropdown_1)
#display(dropdown_2)
#display(fig)
display(ipywidgets.VBox([dropdown_1, dropdown_2, fig]))
# The above visualization is just to show interactivity between different variables in the dataset. Some may not give a relavant inference but a combination of fields like the education of a professional with the salary earned can be depicted well in the visualization where we see professionals with a bachelors degree earning on the higher end.
plt.figure(figsize= (20,10))
plt.rcParams["figure.dpi"]=100
plt.title("Years of Experience in various Job Titles" , size = 40 , color = 'green')
sns.scatterplot(x='YearsWithThisTypeOfJob',y='JobTitle',hue='EmploymentStatus', data = df)
plt.legend(bbox_to_anchor=(1.05,1),loc=2, borderaxespad=0)
# The visualisation compares the job title with the experience possessed by the respective employee with the title and also brings in the kind of employment status she/he has.<br>
# We can observe that people working as an App code Developer or having a DBA related job title are quite highly experienced professionals whereas Data scientists and technicians on the other hand have limited individuals with less experience.<br>
# We observer a mix of numbers when comparing experience with job titles.Only a few people have a work experience beyond 30 years
#plt.show()
plt.figure(figsize= (20,20))
#.loc[row_indexer,col_indexer] = value instead
df_Sal = df
#df_Sal = df[df['Country'].isin(["United States", "United Kingdom","New Zealand","Canada","Australia", "India"])]
X1= df_Sal['SalaryUSD'].mean()
plt.suptitle("Employee salaries per country for different Employment types", fontsize=30);
#Y = (["United States", "United Kingdom", "New Zealand","Canada","Australia", "India" ])
sns.scatterplot(x='SalaryUSD', y="Country",hue='EmploymentStatus', data = df_Sal)
plt.legend(bbox_to_anchor=(1.05,1),loc=2, borderaxespad=0)
#X1
# The above visualisation compares the Salaries of various countries and gives us a general trend in terms of what salary numbers are paid by each country with respect to the Employment Status of its citizens.
# We can observe that the people who have the employment status as a full time employee on an average have the highest salaries in all the countries in general.
# We observe a mean salary of $92,814.64 amongst all the countries wherein majority of the individuals earn a salary that is less than the overall average as seen from the graph
# # References
# bqplot documentation<br>
# matplotlib documentation<br>
# stackoverflow<br>
# seaborn documentation<br>
# plotly documentation<br>
# ipywidget documentation<br>
# pandas documentation